Skip to content

docs(simd): TD-SIMD-8 — F16 honesty + matrix audit for missing lanes#178

Merged
AdaWorldAPI merged 1 commit into
masterfrom
claude/pr-x-td8-f16-honesty
May 20, 2026
Merged

docs(simd): TD-SIMD-8 — F16 honesty + matrix audit for missing lanes#178
AdaWorldAPI merged 1 commit into
masterfrom
claude/pr-x-td8-f16-honesty

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Summary

TD-SIMD-8 (F16 honesty) + matrix audit for missing lane wrappers (U16/U32/U64, I4/8/16/32/64, F32 — user request).

F16 honesty

  • src/simd_half.rs::F16x16 — docstring now explicitly discloses scalar [u16; 16] storage and routes hot loops to core::simd::f16x16 (under nightly-simd) or to fp32 with conversion at boundaries.
  • Disambiguates from simd_avx2::F16Scaler — that's a scaling context for range-normalizing values before f16 encoding, NOT the F16x16 SIMD type. Both files now cross-reference each other.

Matrix corrections

Cross-referenced every pub struct *x* in simd_avx512.rs, simd_avx2.rs, simd_neon.rs, simd_nightly/mod.rs against the parity matrix. Found these gaps:

Change Reason
F32x8 v3: ❌ → ✅ __m256 (in simd_avx512) src/simd.rs:294 already imports it on v3 path; it's AVX (not AVX-512), works Sandy Bridge+
F64x4 v3: ❌ → ✅ __m256d (in simd_avx512) Same as F32x8
U32x8 row added nightly-only; missing on x86 / aarch64 / scalar
U64x4 row added nightly-only
U16x16 row added missing on EVERY backend (incl. nightly)
I32x8 row added missing on EVERY backend (incl. nightly)
I64x4 row added missing on EVERY backend (incl. nightly)
F32Mask8 row added declared as F32Mask8Scalar in simd_scalar; not surfaced through crate::simd::*
F64Mask4 row added declared as F64Mask4Scalar in simd_scalar; not surfaced

Sub-byte lanes section added

I4 / U4 (4-bit nibbles) used by INT4 quantized inference (Q4_0, Q4_K, GPTQ, AWQ). No first-class wrapper exists anywhere — consumers pack 2× nibbles per byte and operate through U8x64 with shr_epi16 + & 0x0F masks. Documents the hardware story (AVX-512 VBMI2 VPCOMPRESSB, VPMADD52 on x86; shr+mask on aarch64). Tracked as TD-SIMD-11 if a consumer files for it.

TD-SIMD-8 row updated

§5 entry now points at src/simd_half.rs:123 (the actual F16x16 polyfill) rather than the unrelated F16Scaler at simd_avx2.rs:2566. Documents the three remediation options: (a) wire _mm256_cvtph_ps under target_feature = "f16c" (Ivy Bridge+; all AVX-512 hosts), (b) F16x16Scalar alias to make scalar nature explicit at consumer call sites, (c) type-level doc-warning. ~80 LoC estimate.

Test plan

  • Docs-only — cargo check paths unchanged.
  • cargo fmt --check clean (no Rust code changed beyond two doc comments).
  • CI green — no behavior change.

Generated by Claude Code

# F16 honesty (TD-SIMD-8)

`src/simd_half.rs` F16x16: docstring now explicitly discloses scalar
storage and routes hot loops to `core::simd::f16x16` (under
`nightly-simd`) or to fp32 with conversion at boundaries. Disambiguates
from `simd_avx2::F16Scaler` — a scaling CONTEXT for range-normalizing
values before f16 encoding, not the F16x16 SIMD type. Both files cross-
reference each other so a future reader doesn't repeat the confusion.

`src/simd_avx2.rs` F16Scaler: docstring strengthened with the same
disambiguation note.

# Matrix audit (user request)

Cross-referenced every `pub struct *x*` in simd_avx512.rs, simd_avx2.rs,
simd_neon.rs, simd_nightly/mod.rs against the parity matrix in the
architecture doc. Corrections:

- **F32x8 / F64x4 v3 column: ❌ → ✅ `__m256`/`__m256d` (in `simd_avx512`)**.
  The dispatch at `src/simd.rs:294` already imports these from
  simd_avx512 on the v3 / AVX2 path. They're AVX (not AVX-512), so they
  work on every Sandy Bridge+ host. The matrix was stale.
- **U32x8, U64x4 rows added** — nightly-only currently; ❌ on x86 +
  aarch64 + scalar. core::simd has them via `simd_nightly`.
- **U16x16, I32x8, I64x4 rows added** — missing across EVERY backend
  including nightly. Theoretical 256-bit shapes no consumer has reached
  for yet.
- **F32Mask8 / F64Mask4 rows added** — declared in simd_scalar as
  `F32Mask8Scalar` / `F64Mask4Scalar` (rename came from a duplicate-
  decl conflict on i686); not surfaced through `crate::simd::*`. AVX-512
  has them natively via `__mmask8` but they're not typed.
- **Sub-byte lanes section added** — I4 / U4 lanes used by INT4
  quantized inference (Q4_0, Q4_K, GPTQ, AWQ). No first-class wrapper;
  consumers pack 2× nibbles per byte and operate through U8x64 + shr/
  mask. Documents the hardware story (AVX-512 VBMI2, VPCOMPRESSB on
  x86; shr+mask trick on aarch64). Tracked as TD-SIMD-11 if a consumer
  files for it.

TD-SIMD-8 description updated in §5 to point at `simd_half.rs:123` (the
actual F16x16 polyfill) rather than `simd_avx2.rs:2566` (the unrelated
F16Scaler scaling utility).
@AdaWorldAPI AdaWorldAPI merged commit 2f096d3 into master May 20, 2026
16 checks passed
AdaWorldAPI pushed a commit that referenced this pull request May 20, 2026
…ss all backends

PR #178's matrix audit surfaced five 256-bit int lane types that were
either entirely missing or stranded in `simd_nightly` only. Adds them
across every backend so `crate::simd::{U16x16, U32x8, U64x4, I32x8,
I64x4}` resolves uniformly on v3 / v4 / native / nightly / scalar /
aarch64 paths.

`src/simd_avx2.rs`
  + 5× `avx2_int_type!` instantiations producing scalar-storage
    `[$elem; $lanes]` polyfills (align 64). Same macro pattern as the
    existing 512-bit polyfills (U8x64, U16x32, …). Native AVX2 `__m256i`
    upgrades are TD-SIMD-3.
  + 5× lowercase aliases (`u16x16 = U16x16`, etc.) matching the
    std::simd convention used by every other lane type in the file.

`src/simd_scalar.rs`
  + 5× `impl_int_type!` instantiations mirroring the AVX2 polyfills
    above. Consumers on non-x86/non-aarch64 (wasm32, riscv, thumb)
    reach the same type names through `crate::simd::*`.
  + Lowercase aliases.

`src/simd_avx512.rs`
  + Re-export of the new types from `simd_avx2` so the v4 dispatch
    arm in `simd.rs` can surface them without forking the macro into
    this file. Both files are already gated on `target_arch = "x86_64"`,
    so the re-export is cheap. Native `__m256i` upgrades here are
    TD-SIMD-3 (same story as the v3 polyfills).

`src/simd_nightly/u_word_types.rs`
  + `U16x16` wrapper backed by `core::simd::u16x16`. Same API surface
    as the existing 32-/16-/8-lane wrappers — splat, from_slice,
    from_array, to_array, copy_to_slice, reduce_{sum,min,max},
    simd_min/max, cmpeq_mask, cmpgt_mask, Default.

`src/simd_nightly/i_word_types.rs`
  + `I32x8` and `I64x4` wrappers backed by `core::simd::{i32x8, i64x4}`.
    Same API surface as siblings; PartialEq via array compare.

`src/simd_nightly/mod.rs`
  + Re-exports for the three new types + lowercase aliases.

`src/simd.rs`
  + All 5 dispatch arms (nightly, v4, v3, aarch64, scalar fallback)
    updated to surface the new types through `crate::simd::*`.

`.claude/knowledge/simd-dispatch-architecture.md`
  + Parity matrix updated — the five rows previously marked ❌ across
    most backends now show 🟠 polyfill (v3, v4-via-v3, scalar) /
    🔵 (nightly via `core::simd`).

Verified: `cargo check` clean under default v3 features and under
`-Ctarget-cpu=x86-64-v4` (via `CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_RUSTFLAGS`
+ explicit `--target` so build scripts don't SIGILL on non-AVX-512
runners — same pattern as the tier4-avx512-check job).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants